Skip to content

MINOR: [C++][Parquet] Reject too-short statistics values in FormatStatValue#50025

Open
jmestwa-coder wants to merge 2 commits into
apache:mainfrom
jmestwa-coder:stat-value-short-bound
Open

MINOR: [C++][Parquet] Reject too-short statistics values in FormatStatValue#50025
jmestwa-coder wants to merge 2 commits into
apache:mainfrom
jmestwa-coder:stat-value-short-bound

Conversation

@jmestwa-coder
Copy link
Copy Markdown

FormatStatValue in cpp/src/parquet/types.cc memcpys sizeof(physical_type) bytes from a string_view that comes straight out of the Thrift statistics block, so a crafted file with min_value or max_value shorter than the column's width (for example, a zero-byte stat on an INT96 column) reads past the buffer. The check belongs in this callee since callers in printer.cc just forward whatever the metadata holds. New ASSERT_THROW coverage in types_test.cc exercises the short-value paths for BOOLEAN, INT32, INT64, FLOAT, DOUBLE, INT96, Decimal and Float16.

Comment thread cpp/src/parquet/types.cc
break;
case Type::BYTE_ARRAY:
case Type::FIXED_LEN_BYTE_ARRAY:
if (logical_type != nullptr && logical_type->is_float16()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not dealing with the length of FIXED_LEN_BYTE_ARRAY?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FormatStatValue doesn't receive the column's type_length, so it can't validate the full FLBA width here. The only fixed-size read for FLBA is the 2-byte float16 load, which is what this case guards. The decimal, string and hex paths all key off val.size() and stay in bounds.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit that it is better than nothing to check types with known sizes. It still looks confusing with this partial support. If we do want to add the check, I think we need to support all types. We can either pass the type instance instead of only the enum or pass the type length to FormatStatValue. The challenge is that FormatStatValue is a public API so we cannot changing its signature without breaking backward compatibility.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we need to create an issue for changes like this.

Comment thread cpp/src/parquet/types.cc Outdated
case Type::INT96:
required = 3 * sizeof(int32_t);
break;
case Type::BYTE_ARRAY:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to put BYTE_ARRAY and UNDEFINED together since they are skipped. In case you don't know, BYTE_ARRAY cannot have float16 logical type.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Moved BYTE_ARRAY down with UNDEFINED since it can't carry a float16 logical type and has no fixed-width load here anyway.

Comment thread cpp/src/parquet/types.cc
default:
break;
}
if (val.size() < required) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little hesitant to complicate FormatStatValue by introducing this check. FormatStatValue is usually called in the printer when we want to print some information of a parquet file for debugging purpuse.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, it's mostly the debug/printer path. The thing is val is taken verbatim from the file's Thrift stats, so a truncated min/max makes the BOOLEAN/INT96/numeric memcpys read past the end of the buffer even there. The guard is just a size check in front of the existing loads, so I tried to keep it minimal. Happy to trim it further if it still feels too heavy.

@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants